High-resolution freshwater dissolved calcium and pH data layers for Canada and the United States

Freshwater ecosystems are biologically important habitats that provide many ecosystem services. Calcium concentration and pH are two key variables that are linked to multiple chemical processes in these environments, influence the biology of organisms from diverse taxa, and can be important factors affecting the distribution of native and non-native species. However, it can be challenging to obtain high-resolution data for these variables at regional and national scales. To address this data gap, water quality data for lakes and rivers in Canada and the continental USA were compiled and used to generate high-resolution (10 × 10 km) interpolated raster layers, after comparing multiple spatial interpolation approaches. This is the first time that such data have been made available at this scale and resolution, providing a valuable resource for research, including projects evaluating risks from environmental change, pollution, and invasive species. This will aid the development of conservation and management strategies for these vital habitats.

section).For spatial interpolation, the median value for each site was used, since this measure is comparatively robust to outliers.Medians for each site were converted to spatial data and reprojected into the North America Albers Equal Area Conic projection, using the sf package 131 .

Spatial interpolation methods.
To select the approach used to generate the interpolated data layers, three interpolation methods were compared (Table 3): nearest neighbour (NN), inverse distance weighting (IDW), and ordinary kriging (OK).NN is the simplest method, providing a baseline against which the more advanced methods can be compared; each point for which an interpolated value was required was assigned the value from the closest available data point.IDW uses a combination of values from multiple data points, weighted by distance.For IDW, arbitrary or 'standard' values for nmax (the maximum number of points to be considered when predicting a value for a specific grid cell) and idp (the inverse distance power parameter, which controls how the weighting of data points varies with distance) are often used 132 .In this case, however, the optim function was used to find values of idp and nmax for the calcium and pH data which minimised two different error metrics, root-mean-square error (RMSE) and mean absolute error (MAE), during preliminary 5-fold cross-validation (Table 4).OK is a geostatistical technique, which uses a fitted model of the spatial autocorrelation among data points (a 'semi-variogram' or 'variogram') to derive the weights used for the interpolation of values to each grid cell.OK often generates superior results to IDW 133 , but this is not always the case 132,134 .An additional advantage of OK is that it generates a measure of statistical uncertainty (Kriging variance) for each interpolated value; this is not typically provided by other methods.Kriging variance is influenced by the distances from the interpolated points to locations with data, and by the spatial covariance relationship determined by the fitted variogram; greater variance indicates greater distance from measured values and thus greater uncertainty in the interpolated values.Variograms for each variable were fitted using the automap package 135 , which automatically selects   relevant models and parameter values that best fit the empirical variogram (Table 3), although constraints can be applied to the process.In this case, variograms were fitted with and without a fixed 'nugget' of zero, since manual setting of this parameter can sometimes be advantageous 136 .In all cases, OK was restricted to a nmax of 100 and nmin (the minimum number of data points to consider) of 15; changes to these numbers had little to no impact on the error metrics obtained during preliminary 5-fold cross-validation.Spatial models for all interpolation methods were fit using the gstat package 137 .
Leave-one-out cross-validation (LOOCV) was performed for each method to compare their predictive accuracies.This technique drops an individual point from the dataset and then uses the remaining data to interpolate a value for the location of the dropped data; this is repeated for all available data points.The interpolated values for each point were compared to the real measured values and used to calculate multiple performance metrics (Table 4): the correlation coefficient r; three absolute error measures, RMSE, MAE, and the mean bias error (MBE); and a measure of relative error, the median symmetric accuracy 138 (MSA).The interpolation methods were compared by considering their scores in these metrics.Initially, the intention was to use an 'objective' function 136 to integrate these metrics into a single performance score.However, this was not necessary, since for both calcium concentration and pH one method had the best scores in all key metrics (see Technical Validation).Since predictive accuracy of any interpolation method can vary spatially 132,136 , error metrics were also calculated using data from each individual province, territory and state.
On the basis of these comparisons a final interpolation method was selected for each variable.Calcium and pH values were then interpolated onto a grid with a cell size of 10 × 10 km 2 using the gstat::predict function and the resulting grids were converted to raster format 139 .Interpolated rasters were masked using outlines of Canada and the USA from the rnaturalearth package 140 .Rasters of Kriging variance for each variable were also generated at the same resolution.

Metric Formula Interpretation
Correlation coefficient, r Correlation between ′ Z i and Z i High correlation between observed and interpolated values suggests accurate interpolation Mean Absolute Error (MAE) A measure of the average magnitude of prediction errors Root Mean Square Error (RMSE) Another measure of the average prediction error.Compared to MAE, RMSE is more sensitive to large outliers.

Mean Bias Error (MBE)
By including the sign of individual residuals, MBE shows whether (on average) the interpolation tends to over-or under-predict Measure of 'typical' proportional error, symmetric and robust to outliers Table 4. Error metrics considered for comparison of interpolation methods.Z i is the observed value at a point, ′ Z i is the value interpolated for that point during LOOCV.

Data records
Project data are available at Data Dryad 141 .The data provided include the final interpolated rasters (and kriging variance rasters) for calcium and pH, the point data used for the interpolations (summary statistics for each site) and the underlying data for each site on each date (Table 5).Rasters were generated by Ordinary Kriging with a pre-defined zero nugget: for both variables this was the best-performing method (see Technical Validation).All rasters use the North America Albers Equal Area projection (ESRI:102008), have a resolution of 10 × 10 km, and have been provided in geotiff format.For each interpolation, the associated kriging variance rasters have been provided; these can be used to identify areas of higher uncertainty resulting from low availability of water quality data.Rasters are provided both 'masked' (using country outlines for the USA and Canada from the rnaturalearth package 140 such that values are only provided for land area) and 'unmasked' .The latter allow users to use their own territorial outlines for masking, to resample the rasters at different resolutions, or to reproject the rasters (for example into a latitude-longitude projection).It is advisable to perform these latter operations prior to masking the rasters with territory outlines.Some example R scripts to facilitate masking and reprojection can be found in the associated GitHub repository 142 (see Code Availability, below).
The 'sites' files contain the site data used for the interpolations (Table 5).This includes the following summary statistics for each site: median (used for the interpolations), number of dates with data, total number of records included, mean, standard deviation, minimum, maximum, 25 th percentile, 75 th percentile.Information on data sources and years with data is also included.The 'site-date' files contain summary data for each site on each date with available data: the median, mean, standard deviation, minimum, maximum, 25 th percentile, 75 th percentile, and number of records.Data sharing agreements with some organisations do not permit open sharing of their data and thus these records are not included in the databases (11,901 sites for calcium, 7601 sites for pH).For a small number of sites for which both public and proprietary data were available (calcium: 383 sites, pH: 61 sites), summary statistics have been recalculated using only public data; consequently, the values provided may not exactly match those used for the interpolations.The associated metadata files include full information on the contents of each of the data files.Finally, the source_ids.csvfile includes identifying information for the data sources included in the databases.
There are obvious spatial patterns in freshwater calcium concentrations across the continent (Fig. 1), reflecting the relationship with the chemical composition of the underlying bedrock.These include large areas of comparatively low calcium on the east and west coasts of Canada and the USA, and a large area corresponding to the Canadian Shield geological region.Calcium concentrations are comparatively high (30 mg L −1 or greater) across a continuous broad area running from the southern United States up to Yukon and Alaska.Areas of high and low calcium tend to correspond with areas of high and low pH (Fig. 2).

technical Validation
Calcium.The final calcium database used for the interpolations included records for 97,648 sites; the publicly shareable dataset includes 85,747 sites.Median calcium concentrations for individual sites ranged from 0.06 to 500 mg L −1 , but 95% of sites had median calcium concentrations of 115 mg L −1 or lower (Fig. 3).The highest concentrations of sites were mostly in the eastern United States and parts of southern Ontario, Quebec, and New Brunswick, while coverage was lowest in Alaska, the Canadian territories (Yukon, Northwest Territories, and Nunavut), and parts of northern Quebec (Fig. 4a).The majority of sites (56%) were sampled multiple times;    In most areas sites were, on average, sampled at least twice; however, there were areas of northern Ontario and Quebec where only a single data point was provided for most sites (Fig. 4b).Some of the provided data for this  area were already temporally-averaged values, so this does not mean that all data points for these areas were based on single measurements.Temporal variation in measurements from individual sites is to be expected as a result of measurement error and temporal change, such as seasonal fluctuations in calcium concentrations (Fig. 5).However, the scale of temporal variation at individual sites was generally smaller than the spatial variation among sites.The interquartile range for temporally-averaged calcium concentrations across all sites was 48.5 mg L −1 , while the median interquartile range for calcium measurements at individual sites was 5 mg L −1 , and 75% of sites had an interquartile range of 12.5 or less.
For the interpolation of calcium concentrations, the zero-nugget Kriging method (OK-ZN) had the highest r value and the lowest error metrics (excluding the proportional error, MSA, which was very similar to the lowest value); in particular, the bias error (MBE) was lower than the other methods (Table 6).The IDW interpolations, however, were not substantially worse.At the province, territory, and state level, the outcome was mostly similar: OK-ZN was the best or joint-best method in 53 out of 62 cases (Tables 7, 8); OK was slightly superior for four areas, the IDW methods (IDW-OR and IDW-OM) were superior for 4 areas, and in one US state NN was the best method.There appeared to be no tendency for the best approach to vary with number of data points in each area; OK-ZN was generally superior for states, provinces and territories with low (e.g., Mississippi, n = 75) and high (e.g., Florida, n = 13,591) numbers of data points.Consequently, the zero-nugget kriging interpolation (OK-ZN) was selected as the preferred interpolation method.
pH.The final pH database used for the interpolations included records for 208,784 sites; the publicly shareable dataset includes 201,183 sites.The median pH across all sampled sites was 7.9, and 95% of sites had a median pH between 5.4 and 8.74 (Fig. 6).Density of sites was high across much of the USA, with a considerably higher number of sites than for calcium (Fig. 7a).Coverage tended to be sparser for Canada, with some areas, such as northern Saskatchewan and northern Manitoba, having fewer sites with available data compared to calcium.Compared to the calcium data, a greater proportion (67%) of sites had data from more than one date, and sites tended to have data from a greater number of sampling dates (median dates sampled = 4, mean dates sampled = 17.2).However,   there were again areas of Quebec and Ontario where the data tended to be based on single values for each site (Fig. 7b).Temporal fluctuation at individual sites was also evident for pH (Fig. 8).However, the scale of temporal variation for individual sites was again smaller than the spatial variation among sites.The median interquartile range for pH measurements at individual sites was 0.3, with 75% of sites having an interquartile range of less than 0.46; the interquartile range across all sites (spatial variability) was 0.97.Error metrics for the pH interpolations were generally very low, including RMSE and MAE; this is to be expected, since the restricted range of feasible pH values makes extremely large errors impossible.While it is not valid to directly compare most metrics between interpolations with different scales and based on different data, it is worth noting that the proportional errors (MSA) were considerably lower for the pH interpolations compared to the calcium interpolations.It is important to be aware, however, that since pH is measured on a logarithmic scale, apparently small differences may have comparatively large physical and chemical implications.There was little variation in the accuracy of the different interpolation methods, with most error metrics being similar for most of the methods (Table 9).However, OK-ZN had the best (or equal-best) scores in every metric excluding the proportional error, which was very close to the lowest value.For individual provinces, territories and states, the situation was similar (Tables 7, 8); OK-ZN was the best or equal-best method in 46 cases, with IDW-OM / IDW-OR being slightly better for the others.Consequently, the zero-nugget kriging interpolation (OK-ZN) was selected as the preferred interpolation method.

State
Kriging variance maps generated by the selected interpolation methods can be used to identify areas of higher uncertainty in the interpolated values, and maps for the two variables show broadly similar patterns (Fig. 9).Across much of the USA and some of the Canadian provinces, there were high densities of sites (Figs.4a, 7a); kriging variance was lower in these areas, indicating comparatively lower uncertainty in the interpolated values.Variance, and therefore uncertainty, was highest in the northern areas of the continent, where there were fewer sites with data.Interpolated values in such areas should be treated with some caution, since they are more distant from locations with measured values.These areas could be prioritised for future sampling if more certainty is required for estimates of pH and calcium concentrations.

Usage Notes
Example application -invasive species risk assessment for dreissenid mussels.To illustrate the advantages of high-resolution calcium and pH data, Wilcox et al. 143 performed a continental-scale risk assessment for two species of invasive, freshwater dreissenid mussels using the new data layers.The two species, the zebra mussel Dreissena polymorpha and the quagga mussel D. rostriformis bugensis, have significant ecological and economic impacts on freshwater ecosystems in North America 144 , but can only survive, grow, and reproduce in waters with sufficiently high concentrations of dissolved calcium and within a particular pH range 5 .Due to the lack of high-resolution calcium and pH data, previous risk assessments 13,15 for these species have been limited to ecoregion-or sub-drainage-level resolution, and have not included Alaska and large areas of Canada (the Maritime provinces, Newfoundland and Labrador, and the Arctic).By combining the new calcium and pH data layers with additional high-resolution bioclimatic variables (e.g.temperature) from WorldClim 145 , Wilcox et al. 143 were able to model habitat suitability for both dreissenid species for the entire extent of Canada and the continental USA, assess the importance of calcium and pH relative to additional bioclimatic drivers of mussel distributions, and calculate the relative risk of invasion for every Canadian province and territory at a 10 km 2 resolution.
Limitations.'Big data' approaches can be a useful tool for water quality projects, but are not without limitations 146 , and the aggregation of many different data sources, with highly variable quality control standards, necessitates some care in their use.Given the extremely large number of data points involved, inspection of individual data records was not possible.Therefore, some of the data filtering and cleaning approaches may have resulted in the exclusion of some valid records.On the other hand, it is likely that some low-quality data remain in the final database used Fig. 8 Temporal trends in pH for ten sites, selected (from among the 100 sites with data for the most dates) to have the longest temporal coverage and to come from 10 different administrative regions; source region given for each plot, along with number of dates with data.Dotted lines mark the interquartile range for each site.Outliers have been removed for presentation, but were not excluded from calculation of site statistics.for the interpolations.For example, while certain sources provided enough information to be able to quickly screen out inappropriate sample types, most sources did not.Points with large or obvious errors in their values or positions were easy to identify, and therefore to remove or correct.Records with incorrect or inaccurate -but plausible -values or positions were effectively impossible to identify and remove.The large number of records used for the final calcium and pH databases should, however, minimise the impact of these types of error on the final interpolations.
There are also a few limitations to using spatial interpolation to create such large-scale maps of water quality variables.Calcium concentrations and pH are primarily driven by the underlying geology 147 ; transitions between underlying rock types can be relatively well delineated, and interpolation across such boundaries may produce results that do not reflect reality.This problem is likely to be minimised in areas with a high density of data points but may be important in data-poor regions.For example, there are relatively large swathes of northern Quebec and Arctic Canada for which no calcium or pH data were available.This may not be problematic in some contexts; for example, in the case of invasive species risk assessment, there are other factors (low temperature, remote location) that may make these areas low risk for many non-native organisms.Geological proxies could be used to predict calcium concentration in locations with no water quality data 148 , but this requires detailed geological information and validated mechanistic models and does not account for effects of plant cover and land use, which influence water chemistry 4,149 .Despite these limitations, geological data could be used to help improve predictions in areas with lower data coverage, or higher uncertainty, via co-kriging, which allows relationships with additional variables to be used during the interpolation process 150,151 .Alternatively, a range of machine learning approaches are available which are also able to use additional information, such as geological data and other environmental covariates; these methods can perform better than traditional geostatistical methods for generating spatial interpolations 152,153 , particularly when the density of data points for the primary variable of interest is low 154 .However, they do not always generate more accurate interpolations; a combined approach, which uses an ensemble of outputs from different interpolation methods with spatially-varying weightings dependant on density of available data, may result in better overall accuracy 136 .Such exercises are good candidates for future improvement to these data layers.

Fig. 1
Fig. 1 Interpolated freshwater calcium concentration map for Canada and the USA, generated using zeronugget Kriging interpolation.

Fig. 2 Fig. 3
Fig. 2 Interpolated freshwater pH map for Canada and the USA, generated using zero-nugget Kriging interpolation.

Fig. 5
Fig.5 Temporal variation in dissolved calcium concentration for ten sites, selected (from among the 100 sites with data for the most dates) to have the longest temporal coverage and to come from 10 different administrative regions; source region given for each plot, along with number of dates with data.Seasonal fluctations are evident for most of the sites, but are small compared to spatial variation across Canada and the USA.Dotted lines mark the interquartile range for each site.Outliers have been removed for presentation, but were not excluded from calculation of site statistics.Note the differences in vertical scales for individual plots.

Fig. 6
Fig. 6 Frequency distribution of site median pH values.

Table 1 .
Summary of data sources used.FAD = Federal Agency Database (Canada/US); PDR = Public Data Repository; PTA = Provincial or Territorial Agency(Canada); PRP = Peer-Reviewed Publication; OT = Other Organisation or Data Source.† data provided directly by agency contacts, not publicly accessible; ‡ includes data provided by agency contacts and publicly accessible agency databases; all other data were obtained from publicly accessible sources.

Table 2 .
Comparison of "Dissolved" calcium versus different calcium fractions measured for samples where measurements for more than one fraction were supplied in the source data.ρ = Pearson's correlation coefficient.

Table 3 .
List of Interpolation methods used, including parameter values (where applicable).

Table 5 .
141t of data provided in the associated data repository141.Masked rasters have been masked using country outlines such that values are only given for land area, and are therefore 'ready-to-use' .Note that databases (.csv files) only include data from shareable (open source) providers.

Table 7 .
Best -performing calcium and pH interpolation methods for each Province / Territory(Canada), according to LOOCV error score and correlation between observed and predicted values.n = number of sites.Methods: NN = Nearest Neighbour; IDW-OR = Inverse distance weighting, RMSE-optimised; IDW-OM = Inverse distance weighting, MAE-optimised; OK = Kriging; OK-ZN = Kriging, zero nugget.Where two methods are presented separated by '/' this indicates that the two methods performed equally well.Methods presented in parentheses were those with the highest correlation coefficient, r, in cases where this was not the same method with the best scores in the other error metrics.

Table 8 .
Best-performing calcium and pH interpolations for each State (USA).See Table7for abbreviations.